{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "SqIxQYm8h06Z"
   },
   "source": [
    "# Ungraded Lab: Generating Text from Irish Lyrics\n",
    "\n",
    "In the previous lab, you trained a model on just a single song. You might have found that the output text can quickly become gibberish or repetitive. Even if you tweak the parameters, the model will still be limited by its vocabulary of only a few hundred words. The model will be more flexible if you train it on a much larger corpus and that's what you'll be doing in this lab. You will use lyrics from more Irish songs then see how the generated text looks like. You will also see how this impacts the process from data preparation to model training. Let's get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wb1mfOvch4Sv"
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "BOwsuGQQY9OL"
   },
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "jmBFI788pOXx"
   },
   "source": [
    "## Building the Word Vocabulary\n",
    "\n",
    "You will first download the lyrics dataset. These will be from a compilation of traditional Irish songs and you can see them [here](https://github.com/https-deeplearning-ai/tensorflow-1-public/blob/main/C3/W4/misc/Laurences_generated_poetry.txt)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pylt5qZYsWPh"
   },
   "outputs": [],
   "source": [
    "# The dataset has already beed downloaded for you, so no need to run the following line of code.\n",
    "# !wget https://storage.googleapis.com/tensorflow-1-public/course3/irish-lyrics-eof.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-v6JYQGNPXCW"
   },
   "source": [
    "Next, you will lowercase and split the plain text into a list of sentences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "LKOO7DFCPX3L",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "data = open('./irish-lyrics-eof.txt').read()\n",
    "\n",
    "# Lowercase and split the text\n",
    "corpus = data.lower().split(\"\\n\")\n",
    "\n",
    "# Preview the result\n",
    "print(corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EkP2CP0qP8RD"
   },
   "source": [
    "From here, you can initialize the `TextVectorization` class and generate the vocabulary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "PRnDnCW-Z7qv"
   },
   "outputs": [],
   "source": [
    "# Initialize the vectorization layer\n",
    "vectorize_layer = tf.keras.layers.TextVectorization()\n",
    "\n",
    "# Build the vocabulary\n",
    "vectorize_layer.adapt(corpus)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "oQb4sCPJ1a9N",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Get the vocabulary and its size\n",
    "vocabulary = vectorize_layer.get_vocabulary()\n",
    "vocab_size = len(vocabulary)\n",
    "\n",
    "print(f'{vocabulary}')\n",
    "print(f'{vocab_size}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "JK29FzZ7QW-4"
   },
   "source": [
    "## Preprocessing the Dataset\n",
    "\n",
    "Next, you will generate the inputs and labels for your model. The process will be identical to the previous lab. The `xs` or inputs to the model will be padded sequences, while the `ys` or labels are one-hot encoded arrays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "soPGVheskaQP"
   },
   "outputs": [],
   "source": [
    "# Initialize the sequences list\n",
    "input_sequences = []\n",
    "\n",
    "# Loop over every line\n",
    "for line in corpus:\n",
    "\n",
    "\t# Generate the integer sequence of the current line\n",
    "\tsequence = vectorize_layer(line).numpy()\n",
    "\n",
    "\t# Loop over the line several times to generate the subphrases\n",
    "\tfor i in range(1, len(sequence)):\n",
    "\n",
    "\t\t# Generate the subphrase\n",
    "\t\tn_gram_sequence = sequence[:i+1]\n",
    "\n",
    "\t\t# Append the subphrase to the sequences list\n",
    "\t\tinput_sequences.append(n_gram_sequence)\n",
    "\n",
    "# Get the length of the longest line\n",
    "max_sequence_len = max([len(x) for x in input_sequences])\n",
    "\n",
    "# Pad all sequences\n",
    "input_sequences = np.array(tf.keras.utils.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))\n",
    "\n",
    "# Create inputs and label by splitting the last token in the subphrases\n",
    "xs, labels = input_sequences[:,:-1],input_sequences[:,-1]\n",
    "\n",
    "# Convert the label into one-hot arrays\n",
    "ys = tf.keras.utils.to_categorical(labels, num_classes=vocab_size)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "TmWHCO0dQGlZ"
   },
   "source": [
    "You can then print some of the examples as a sanity check."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "pJtwVB2NbOAP"
   },
   "outputs": [],
   "source": [
    "# Get sample sentence\n",
    "sentence = corpus[0].split()\n",
    "print(f'sample sentence: {sentence}')\n",
    "\n",
    "# Initialize token list\n",
    "token_list = []\n",
    "\n",
    "# Look up the indices of each word and append to the list\n",
    "for word in sentence:\n",
    "  token_list.append(vocabulary.index(word))\n",
    "\n",
    "# Print the token list\n",
    "print(token_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "etXdbu-l2mxD"
   },
   "outputs": [],
   "source": [
    "def sequence_to_text(sequence, vocabulary):\n",
    "  '''utility to convert integer sequence back to text'''\n",
    "\n",
    "  # Loop through the integer sequence and look up the word from the vocabulary\n",
    "  words = [vocabulary[index] for index in sequence]\n",
    "\n",
    "  # Combine the words into one sentence\n",
    "  text = tf.strings.reduce_join(words, separator=' ').numpy().decode()\n",
    "\n",
    "  return text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "lMr6kKfzROlW"
   },
   "outputs": [],
   "source": [
    "# Pick element\n",
    "elem_number = 5\n",
    "\n",
    "# Print token list and phrase\n",
    "print(f'token list: {xs[elem_number]}')\n",
    "print(f'decoded to text: {sequence_to_text(xs[elem_number], vocabulary)}')\n",
    "\n",
    "# Print label\n",
    "print(f'one-hot label: {ys[elem_number]}')\n",
    "print(f'index of label: {np.argmax(ys[elem_number])}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "49Cv68JOakwv"
   },
   "outputs": [],
   "source": [
    "# Pick element\n",
    "elem_number = 4\n",
    "\n",
    "# Print token list and phrase\n",
    "print(f'token list: {xs[elem_number]}')\n",
    "print(f'decoded to text: {sequence_to_text(xs[elem_number], vocabulary)}')\n",
    "\n",
    "# Print label\n",
    "print(f'one-hot label: {ys[elem_number]}')\n",
    "print(f'index of label: {np.argmax(ys[elem_number])}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lastly, since this is a larger dataset, you can use the tf.data API to speed up the training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE\n",
    "BATCH_SIZE = 32\n",
    "\n",
    "# Put the inputs and labels to a tf.data.Dataset\n",
    "dataset = tf.data.Dataset.from_tensor_slices((xs,ys))\n",
    "\n",
    "# Optimize the dataset for training\n",
    "dataset = dataset.cache().prefetch(PREFETCH_BUFFER_SIZE).batch(BATCH_SIZE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "VKWWUZm5VPG9"
   },
   "source": [
    "## Build and compile the Model\n",
    "\n",
    "Next, you will build and compile the model. We placed some of the hyperparameters at the top of the code cell so you can easily tweak it later if you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "w9vH8Y59ajYL"
   },
   "outputs": [],
   "source": [
    "# Parameters\n",
    "embedding_dim = 100\n",
    "lstm_units = 150\n",
    "learning_rate = 0.01\n",
    "\n",
    "# Build the model\n",
    "model = tf.keras.models.Sequential([\n",
    "            tf.keras.Input(shape=(max_sequence_len-1,)),\n",
    "            tf.keras.layers.Embedding(vocab_size, embedding_dim),\n",
    "            tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_units)),\n",
    "            tf.keras.layers.Dense(vocab_size, activation='softmax')\n",
    "])\n",
    "\n",
    "# Use categorical crossentropy because this is a multi-class problem\n",
    "model.compile(\n",
    "    loss='categorical_crossentropy',\n",
    "    optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),\n",
    "    metrics=['accuracy']\n",
    "    )\n",
    "\n",
    "# Print the model summary\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "OpI0d9cfR43c"
   },
   "source": [
    "## Train the model\n",
    "\n",
    "From the model summary above, you'll notice that the number of trainable params is much larger than the one in the previous lab. Consequently, that usually means a slower training time. It will take roughly 7 seconds per epoch with the GPU enabled in Colab and you'll reach around 76% accuracy after 100 epochs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "Nc4zC7C4jJpN",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "epochs = 100\n",
    "\n",
    "# Train the model\n",
    "history = model.fit(dataset, epochs=epochs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WgAzLnLATFts"
   },
   "source": [
    "You can visualize the accuracy below to see how it fluctuates as the training progresses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "3YXGelKThoTT"
   },
   "outputs": [],
   "source": [
    "# Plot utility\n",
    "def plot_graphs(history, string):\n",
    "  plt.plot(history.history[string])\n",
    "  plt.xlabel(\"Epochs\")\n",
    "  plt.ylabel(string)\n",
    "  plt.show()\n",
    "\n",
    "# Visualize the accuracy\n",
    "plot_graphs(history, 'accuracy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "9gxKIcvGTUnw"
   },
   "source": [
    "## Generating Text\n",
    "\n",
    "Now you can let the model make its own songs or poetry! Because it is trained on a much larger corpus, the results below should contain less repetitions as before. The code below picks the next word based on the highest probability output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "6Vc6PHgxa6Hm"
   },
   "outputs": [],
   "source": [
    "# Define seed text\n",
    "seed_text = \"help me obi-wan kenobi youre my only hope\"\n",
    "\n",
    "# Define total words to predict\n",
    "next_words = 100\n",
    "\n",
    "# Loop until desired length is reached\n",
    "for _ in range(next_words):\n",
    "\n",
    "\t# Generate the integer sequence of the current line\n",
    "\tsequence = vectorize_layer(seed_text)\n",
    "\n",
    "\t# Pad the sequence\n",
    "\tsequence = tf.keras.utils.pad_sequences([sequence], maxlen=max_sequence_len-1, padding='pre')\n",
    "\n",
    "\t# Feed to the model and get the probabilities for each index\n",
    "\tprobabilities = model.predict(sequence, verbose=0)\n",
    "\n",
    "\t# Get the index with the highest probability\n",
    "\tpredicted = np.argmax(probabilities, axis=-1)[0]\n",
    "\n",
    "\t# Ignore if index is 0 because that is just the padding.\n",
    "\tif predicted != 0:\n",
    "\n",
    "\t\t# Look up the word associated with the index.\n",
    "\t\toutput_word = vocabulary[predicted]\n",
    "\n",
    "\t\t# Combine with the seed text\n",
    "\t\tseed_text += \" \" + output_word\n",
    "\n",
    "# Print the result\n",
    "print(seed_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "wHtrtAFAT6tn"
   },
   "source": [
    "Here again is the code that gets the top 3 predictions and picks one at random."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "yJfzKm-8mVKD"
   },
   "outputs": [],
   "source": [
    "# Define seed text\n",
    "seed_text = \"help me obi-wan kenobi youre my only hope\"\n",
    "\n",
    "# Define total words to predict\n",
    "next_words = 100\n",
    "\n",
    "# Loop until desired length is reached\n",
    "for _ in range(next_words):\n",
    "\n",
    "\t# Convert the seed text to an integer sequence\n",
    "  sequence = vectorize_layer(seed_text)\n",
    "\n",
    "\t# Pad the sequence\n",
    "  sequence = tf.keras.utils.pad_sequences([sequence], maxlen=max_sequence_len-1, padding='pre')\n",
    "\n",
    "\t# Feed to the model and get the probabilities for each index\n",
    "  probabilities = model.predict(sequence, verbose=0)\n",
    "\n",
    "  # Pick a random number from [1,2,3]\n",
    "  choice = np.random.choice([1,2,3])\n",
    "\n",
    "  # Sort the probabilities in ascending order\n",
    "  # and get the random choice from the end of the array\n",
    "  predicted = np.argsort(probabilities)[0][-choice]\n",
    "\n",
    "\t# Ignore if index is 0 because that is just the padding.\n",
    "  if predicted != 0:\n",
    "\n",
    "    # Look up the word associated with the index.\n",
    "    output_word = vocabulary[predicted]\n",
    "\n",
    "    # Combine with the seed text\n",
    "    seed_text += \" \" + output_word\n",
    "\n",
    "# Print the result\n",
    "print(seed_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "DP0--sdMUJ_k"
   },
   "source": [
    "## Wrap Up\n",
    "\n",
    "This lab shows the effect of having a larger dataset to train your text generation model. As expected, this will take a longer time to prepare and train but the output will less likely become repetitive or gibberish. Try to tweak the hyperparameters and see if you get better results. You can also find some other text datasets and use it to train the model here.  "
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "private_outputs": true,
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0rc1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
